Introduction to Open Data Science - Course Project

About the project

date()
## [1] "Mon Nov 16 22:41:35 2020"

The text continues here.

What to expect for this course? The truth is, I don’t know. I have some previous background in programming, mostly with Python, but I have never studied data analysis in any way.

For this course, I expect to learn data analytics work flow and R programming.

Link to my GitHub repository for this course: https://github.com/JariRintaaho/IODS-project


Week 2: Regression and model validation

date()
## [1] "Mon Nov 16 22:41:35 2020"
#The data is located in a server. Easyest way to get it, is to use read.csv(url)

url="http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
learning2014=read.csv(url)

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The dataset learning2014 consist 166 rows and 7 variables. The structure of the data is shown above. The dataset is handled using data.frame structures in R. The data is related to International survey o Approaches to Learning by Kimmo Vehkalahti.

# ggplot2 is an excellent library for ploting in R. With 3 lines, it is possible to plot practically everything you have ever wanted to see from a datafile.

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

ggpairs(learning2014, title="Practically everything you want to see from a data set", ggplot2::aes(colour=gender), method = c("everything", "pearson")) 
## Warning in warn_if_args_exist(list(...)): Extra arguments: "method" are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the ggpairs-plot one can see that all the variables, except points, are more or less normally distributed. The gender does not effect dramatically for points. The distributions for males and females are very similar.

# selected explanatory variables for the regression model are: attitude, surf and stra. The target variable is points. 

model = lm(points ~ attitude + surf + stra, data=learning2014) 

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + surf + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## surf         -0.5861     0.8014  -0.731  0.46563    
## stra          0.8531     0.5416   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
# The model seems to fit to the dataset with sufficient accuracy.

The plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

par(mfrow = c(2,2))
plot(model, which=c(1,2,5))

The linear model seems to work OK.


Week 3: Logistic regression

date()
## [1] "Mon Nov 16 22:41:40 2020"

Part 2: downloading the data

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt" 

data=read.csv(url, sep=",")

The glimpse of the data:

The glimpse is much nicer that the summary(). At the moment, we are not interested in statistical parameters.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(data)
## Rows: 382
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation...
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...

The data seems to have 382 rows and 35 columns. There are several numerical columns, several categorical columns and even one logical column. The data is somehow related to students’ alcohol consumption.

Part 3: Hypotesis

The aim of this task is to analyze alcohol consumption level (logical variable high_use) using four parameters. The interesting parameters are freetime, famrel, goout and absences. It think it is quite obvious that those students how has lot of free time and go out often use more alcohol. According to scientific consensus, family relations affect alcohol consumption. Based on personal experience related to alcohol consumption and studies, high level of alcohol consumption causes absence.

Part 4: Histograms and relations with alcohol consumption

First, let’s draw four histograms. One for each variable

# par makes multiple plots look nicer :)
par(mfrow=c(2,2))
hist(data$freetime)
hist(data$famrel)
hist(data$goout)
hist(data$absences)

It seem that freetime and goout are distributed in more or less Gaussian way. Most of the studest have free time and their going out is average.

The average family relation is skewed. Most of the students have family relation 3 to 5. However, there are still some, whose family relation is 1 or 2.

The absence is also skewed. Nearly all of the students have absence less than 20. However, there seems to be at least one, whose absence is larger than 50.

Let’s then see, how the box plots look like

# ggplot 2 has some nice plot styles.
library(ggplot2)

# par makes multiple plots look nicer :)
par(mfrow=c(2,2))
g1 = ggplot(data, aes(x = high_use, y = freetime))
g1 = g1 + ggtitle("Student freetime by alcohol consumption")
g1 + geom_boxplot() + ylab("freetime")

g2 = ggplot(data, aes(x = high_use, y = famrel))
g2 = g2 + ggtitle("Student famrel by alcohol consumption")
g2 + geom_boxplot() + ylab("famrel")

g3 = ggplot(data, aes(x = high_use, y = goout))
g3 = g3 + ggtitle("Student goout by alcohol consumption")
g3 + geom_boxplot() + ylab("goout")

g4 = ggplot(data, aes(x = high_use, y = absences))
g4 = g4 + ggtitle("Student absences by alcohol consumption")
g4 + geom_boxplot() + ylab("absences")

Based on the findings (the box plots), free time seems to have no effect on alcohol consumption. In both groups (high users and low users) the amouth of free time is identical.

Family relations seems to have effect on alcohol consumption. However, with this parameter there are two outlier points. They are not high users, but they have low scores in family relations.

It seems that going out is linked to being a heavy user of alcohol. Thus, there is one outlier point, who goes out often, but doesn’t use much alcohol.

Absence and being a high user seems to go hand in hand. But, not always. There are lots of outlier points. There are several students, who have plenty of absence, but they are not high users.

Based on the findings, low familiy relations and going out often seems to have strong relations for being a high user of alcohol. Also absence has some explanatory value. Opposite to original hypothesis, the amount of free times seems not to affect on alcohol consumption.

Part 5

The logistic regression model can be built using a single command

m <- glm(high_use ~ freetime + famrel +goout + absences, 
         data = data, family = "binomial")

The model summary. The Pr values for each variable shows clear difference. The smaller the Pr value of a parameter is the better explainer of the parameter is. The rule of thumb is, that if Pr < 0.05, it has value for the model. In this case, goout can be seen as an excellent parameter. Both absences and famrel have some value for the model for explaing the high use of alcohol. The free time seems to have no use for the model.

summary(m)
## 
## Call:
## glm(formula = high_use ~ freetime + famrel + goout + absences, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9033  -0.7761  -0.5096   0.8915   2.3925  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.71107    0.68985  -3.930 8.50e-05 ***
## freetime     0.22413    0.13868   1.616 0.106054    
## famrel      -0.43185    0.13766  -3.137 0.001707 ** 
## goout        0.73875    0.12547   5.888 3.91e-09 ***
## absences     0.05893    0.01611   3.658 0.000254 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 385.67  on 377  degrees of freedom
## AIC: 395.67
## 
## Number of Fisher Scoring iterations: 4

The coefficients odds rations and their intervals. The coefficients does not tell much for anyone. But the odds rations and especially their intervals say. If the odds ratio of a parameter is exactly 1, the parameter has no explanatory value for the model. For odds ratio < 1, the parameter has negative correlation and for odds ratio > 1, the correlation is positive.

As one clearly sees, the 95 % confidence boundary for freetime parameter crosses the critical value 1. Therefore, one cannot say, if increased freetime cause high use of alcohol or not.

As expected, family relations have negative correlation. Those studets with high family relations are less likely to be high users of alcohol.

Those students, who go out often, are high users of alcohol, with no doupt. However, the positive correlation between being high user and having lots of absence is only small.

coef(m)
## (Intercept)    freetime      famrel       goout    absences 
## -2.71107006  0.22413401 -0.43185038  0.73875213  0.05893415
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI = confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR      2.5 %    97.5 %
## (Intercept) 0.06646565 0.01651175 0.2485937
## freetime    1.25123868 0.95490180 1.6469720
## famrel      0.64930652 0.49374170 0.8486899
## goout       2.09332169 1.64772445 2.6976795
## absences    1.06070539 1.02929793 1.0978106

Part 6

The predictions for the model can be calculate using predict() function. The predicted values are stored to the same data frame as the actual data is.

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
data <- mutate(data, probability = probabilities)

# use the probabilities to make a prediction of high_use
data <- mutate(data, prediction = (probability > 0.5))

The requested 2 x 2 cross tabulation of predictions versus the actual values. It can easily see, that the model prediction errors are not symmetrical. The tends to miss most of the high users and give some false positive predictions.

# tabulate the target variable versus the predictions
table(high_use = data$high_use, prediction = data$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252   18
##    TRUE     68   44

Week 4: Clustering and classification

date()
## [1] "Mon Nov 16 22:41:42 2020"

Task 1

As you can see, this file have been made and uploaded to GitHub.

# In order to keep the Rmd file clean, all the libraries are downloaded here. Keep scrolling.
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.4     v purrr   0.3.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Task 2

The data to be analyze this week is called Boston. It comes from MASS package. The data is rather classic set of information. It consist data related to housing values in suburbs of Boston and several other parameters including air quality, crime rate and proportion of people of color at those suburbs. The dataset is rather old. It is first published in 1978.

data("Boston")

The str() method is called for checking the basic structure of the data. The data has 506 rows and 14 columns. The variable names and types are listed here:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Task 3

Here is the pairs-plot of the dataset. Since the data has quite many columns, it practically shows you nothing.

pairs(Boston)

It is better to use corrplot()-method for checking the data set. The tidyverse and corrplot packages were downloaded for this.

The correlations between different parameters are shown here. The larger and darker the circle is, the stronger the correlation is. Blue represents positive and red negative correlation.

cor_matrix<-cor(Boston) %>% round(digits=2)

corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

The histograms for each variable can be drawn using ggplot2. Traditionally, the data set is used for demonstrating effects of different parameters to housing values.

ggplot(gather(Boston), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

Task 4

In this task, the data set “Boston” is standardized and crime rate is switched to categorical variable.

As one can see, the mean value for each parameter is zero after scaling.

## First, the data set is scaled and summary is printed
boston_scaled <- scale(Boston)
boston_scaled=as.data.frame(boston_scaled)

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# Then the crime rate is switched to categorical variable.
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

The final part of this task is to split the scaled data set into test and train data using 80-20 split.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

After the model is trained, the results it gives to the test set is less-biased estimator for the actual model accuracy.

Task 5

First, the LDA-model is fitted to the data set. This time, the catergorical crime rate is used as target variable.

# These simple models can be applied using a single line of code.
lda.fit <- lda(crime ~ ., data = train)

A new function “lda-arrows” is defined for the task. The function draws the LDA bi-plot.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

One can see, that is actually relatively easy to separate the high-crime rate areas

Task 6

First, the correct classes are separated from the test set

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)

The rest of the test data is fed into the trained LDA model. The cross tabulated results are printed.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16      10        3    0
##   med_low    8      11        7    0
##   med_high   0      10       15    0
##   high       0       0        0   22

As the results show, the model can predict the high crime rate suburbs with nearly 100% accuracy. However, the model truly struggles when it tries to decide, should a suburb belong to low or med_low category. In addition, there is some problems to find the difference between med_low and med_high suburbs.

Task 7

First, let’s reload and standardize the Boston data set

data("Boston")
boston_scaled <- scale(Boston)
boston_scaled=as.data.frame(boston_scaled)

Second, let’s build a k-mean clustering algorithm with the scaled boston data.

Let’s select 2 as the number of clusters. Since 14x14 pair plot is bit difficult to look at. Let’s draw three smaller ones.

k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled[1:5], col = km$cluster)

pairs(boston_scaled[5:10], col = km$cluster)

pairs(boston_scaled[10:14], col = km$cluster)

The pair plots shows nice separation of two groups. The results are quite simillar as with the LDA algorithm. With LDA we saw that it is relatively easy to find the suburbs with high crime rate. But there was significant overlapping with suburbs that had low, med_low and med_high crime rate.

Bonus

First, let’s perform the k-means clustering using 3 centers.

km2 <-kmeans(boston_scaled, centers = 3)

Then, let’s try to fit the LDA to the new k-means model output.

# These simple models can be applied using a single line of code.
lda.fit2 <- lda(km2$cluster ~ ., data = boston_scaled)

Now, let’s look the results. It seems that the model can predict detect the suburbs with high crime rate using this unsupervised learning method.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(km2$cluster)

# plot the lda results
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Super bonus

First part of the super bonus task

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

(more chapters to be added similarly as we proceed with the course!)